Briefly describe the data set. What information does it contain?
The dataset contains historical data on small/large/xl avocado prices and sales volume from multiple cities, states, and regions in the USA. The data includes records from 2015-2020.
Which major geographical region sold the most total organic, small Hass avocados in 2017?
# Convert the 'date' column to a datetime objectavocado_clean["date"] = to_datetime(avocado_clean["date"])# Filter the data for the year 2017 and organic typeorganic_2017 = avocado_clean[(avocado_clean["year"] ==2017) & (avocado_clean["type"] =="organic")]#Filter to only include metro regionssmallavo_organic2017 = organic_2017[organic_2017["region_type"]=="Metro Region"]#Filter by geography and small avocadossmallavo_organic2017 = smallavo_organic2017.groupby("geography")["small_avo"].sum()smallavo_organic2017
#Find the region with the most salessmallavo_organic2017_geography = smallavo_organic2017.idxmax()print(f"As shown above, the {smallavo_organic2017_geography} sold the most total organic, small Hass avocados in 2017.")
As shown above, the Dallas/Ft. Worth sold the most total organic, small Hass avocados in 2017.
As shown above, the Dallas/Ft. Worth Metro Region sold the most total organic, small Hass avocados in 2017.
Split the date variable into month, day, and year variables. In which month is the highest average volume of avocado sales?
#Date is already converted from last example so extract month and day into seprate columns.avocado_clean["month"] = avocado_clean["date"].dt.monthavocado_clean["day"] = avocado_clean["date"].dt.day
#Prevent from displaying data in scientific notationset_option('display.float_format', '{:.2f}'.format)#Group by month and calculate average total volume avocado_clean_m = avocado_clean.groupby("month")["total_volume"].mean()avocado_clean_m
#Display max monthavocado_clean_m_max = avocado_clean_m.idxmax()print(f"The month with the highest average total_volume is {avocado_clean_m_max}, or the month of May")
The month with the highest average total_volume is 5, or the month of May
Which metro area geographical regions sold the most total avocados? Plot side-by-side box-plots of the total volume for only the five metro geographical regions with the highest averages for the total_volume variable.
#Filter to only show Metro Regionsavocado_metroregion = avocado_clean[avocado_clean["region_type"]=="Metro Region"]#Group by Metro Region average total volumeavovolume_metroregion = avocado_metroregion.groupby("geography")["total_volume"].mean().reset_index()# Sort by average total volumetop5avgvolume = avovolume_metroregion.sort_values(by="total_volume", ascending=False).head(5)#Displays the mean of total volume based on Metro Regiontop5avgvolume
geography
total_volume
3
Dallas/Ft. Worth
712751.90
7
Phoenix/Tucson
625038.26
10
West Tex/New Mexico
460774.33
0
Baltimore/Washington
452229.79
6
Northern New England
256433.89
#Group by Metro Region average total volumeavovolume_metroregion = avocado_metroregion.groupby("geography")["total_volume"].mean().reset_index()# Sort by average total volumetop5avgvolume = avovolume_metroregion.sort_values(by="total_volume", ascending=False).head(5)# Filter the dataset to include only data for the top five regionsfiltered_avo = avocado_metroregion[avocado_metroregion["geography"].isin(top5avgvolume["geography"])]# Create plotregion_volume_plot = ( ggplot(filtered_avo, aes(x="geography", y="total_volume",fill ="geography"))+ geom_boxplot()+ labs(title="Total Avocado Volume by Top 5 Metro Regions", x="Metro Region", y="Total Volume")+ theme(axis_text_x=element_text(rotation=45, hjust=1)))region_volume_plot
<Figure Size: (640 x 480)>
Pivoting
From your cleaned data set, create a data set with only these California regions and answer the following questions about these California regions only.
#Create a data set with only the California Regionsavocado_cali = avocado_clean[avocado_clean['geography'].isin(['San Francisco', 'San Diego', 'Sacramento', 'Los Angeles'])]
In which California regions is the price of organic versus conventional avocados most different? Support your answer with a few summary statistics AND a visualization.
#Filter for seperate dataframescali_organic = avocado_cali[avocado_cali["type"]=="organic"]cali_conventional = avocado_cali[avocado_cali["type"]=="conventional"]# Calculate summary statistics for organic and conventional avocado prices by geographycali_organic_stat = cali_organic.groupby("geography")["average_price"].mean().reset_index()cali_conventional_stat = cali_conventional.groupby("geography")["average_price"].mean().reset_index()# Merge the organic and conventional summariescali_merge = merge(cali_organic_stat, cali_conventional_stat, on="geography", suffixes=("_organic", "_conventional"))# Calculate the price difference between organic and conventional avocadoscali_merge["price_difference"] = cali_merge['average_price_organic'] - cali_merge['average_price_conventional']# Create a bar plot to visualize the price differencecali_price_plot = ( ggplot(summary_df, aes(x="geography", y="price_difference",fill="geography")) + geom_bar(stat="identity",alpha=0.7) + theme(axis_text_x=element_text(angle=0, hjust=0.5)) + labs(x='City in California', y='($) Price Difference (Organic - Conventional)', title="Price Difference Between Avocados in California by City"))cali_price_plot
<Figure Size: (640 x 480)>
cali_merge
geography
average_price_organic
average_price_conventional
price_difference
0
Los Angeles
1.57
1.05
0.53
1
Sacramento
1.87
1.30
0.58
2
San Diego
1.80
1.11
0.68
3
San Francisco
2.12
1.40
0.72
San Francisco has the largest price difference on average between organic and conventional avocados on average.
The following plot shows, for all four California regions, the proportion of the average Hass avocado sales that are small, large, or extra large; conventional vs. organic. Recreate the plot; you do not have to replicate the exact finishing touches - e.g., color, theme - but your plot should resemble the content of this plot.
#Rename columns for graphavocado_cali_8 = avocado_cali.rename(columns={("small_avo"): "Small","large_avo": "Large","xl_avo": "Xlarge"})#Convert to long to add column for sizeavo_cali_long = melt(avocado_cali_8, id_vars=["geography","type"], value_vars=["Small","Large","Xlarge"],var_name="size",value_name="amount")#Calculate average after grouping by city, type, and sizeavo_cali_long = avo_cali_long.groupby(["geography","type","size"]).mean()#Convert back to df reset indexavo_cali_long = avo_cali_long.reset_index()avo_cali_long.head()
geography
type
size
amount
0
Los Angeles
conventional
Large
443842.31
1
Los Angeles
conventional
Small
1172952.34
2
Los Angeles
conventional
Xlarge
97073.64
3
Los Angeles
organic
Large
20804.22
4
Los Angeles
organic
Small
14442.06
#Define colors for plotsize_colors = {'Small': 'orangered','Large': 'mediumseagreen','Xlarge': 'rebeccapurple',}#Create Plotavosales_size = (ggplot(avo_cali_long) + geom_col(aes(x="geography",y="amount",fill="size"),position="fill")+ facet_wrap("type") + scale_fill_manual(values = size_colors)+ labs(x='Region of California',y='Proportion',title="Proportion of Average Hass Avocado Sales by Size")+ scale_y_continuous(labels=["0%","25%","50%","75%","100%"])+ theme_light()+ theme(axis_text_x=element_text(rotation=45, hjust=1)) )avosales_size
<Figure Size: (640 x 480)>
Using Outside Data
# Load new dataset with calculated mean from redfin valuescalihomes_avg_price = read_csv(r"C:\Users\broga\OneDrive\Desktop\MSBA\Computing_ML\Labs\Lab_2\AvgCaliforniaHousing_Price.csv")#Merge dataset on geography (city) column avocado_avgprice = avocado_cali.groupby("geography")["average_price"].mean().reset_index()avo_avgprice_housing = avocado_avgprice.merge(calihomes_avg_price, on=["geography"])avo_avgprice_housing
geography
average_price
housing_price_avg
0
Los Angeles
1.31
1775644.00
1
Sacramento
1.58
471129.00
2
San Diego
1.46
1539145.00
3
San Francisco
1.76
1978296.00
#Plotnine and Seaborn do not support dual axis graphs so decided to use plot.ly#https://plotly.com/python/import plotly.graph_objects as go# Create the dual y-axis plotfigure_housing_price = go.Figure()# Add the "average_price" trace with the first y-axisfigure_housing_price.add_trace(go.Scatter(x=avo_avgprice_housing["geography"], y=avo_avgprice_housing["average_price"], mode="lines+markers", name="Average Avocado Price"))# Add the "housing_price_avg" trace with the second y-axisfigure_housing_price.add_trace(go.Scatter(x=avo_avgprice_housing["geography"], y=avo_avgprice_housing["housing_price_avg"], mode="lines+markers", name="Housing Price Average", yaxis="y2"))# Create plot with dual y-axisfigure_housing_price.update_layout( title="Average Avocado Price vs. Housing Price Average by City", xaxis=dict(title="City in California"), yaxis=dict(title="Average Avocado Price ($)", side='left', showgrid=False), yaxis2=dict(title='Housing Price Average ($)', overlaying='y', side='right', showgrid=False),)# Show the plotfigure_housing_price.show(renderer='notebook')
For San Diego and San Francisco, there seems to be a relationship betweeen average avocado price and average house price, althogh Sacramento and Los Angeles do not appear to have the same relationship. A multitude of factors could be responsible for LA and Sacramento, including the size & diversity of LA’s population. Despite the glamorous stereotype surrounding LA, the entirety of the region/city brings down the average house price. Sacramento follows the same story, as the average housing price is much cheaper than San Diego and San Francisco, but still yields releatively high costs for avocados. Ultimately, we are unable to draw a general conclusion that average avocado price has a strong relationship with average housing price by city in California. It would be interesting to explore the relationship with a smaller explanatory variable that is not skewed from a massive, diverse population such as LA.